Perf 使用手册

安装

安装 perf

apt install -y perf

安装 flamegraph

火焰图是脚本，只需要下载:

git clone https://github.com/brendangregg/FlameGraph.git

认识 Perf

下面将由一个最简单的perf命令，初步了解perf的功能：

perf stat -e cycles ls

Performance counter stats for 'ls':

          1040224      cycles

      0.001632667 seconds time elapsed

      0.001925000 seconds user
      0.000000000 seconds sys

上述命令将展示，执行ls这一条命令，cpu运行了多少条指令。

而perf中还能提供更多的此种信息，例如ls命令运行时执行了多少条分支跳转指令，分支跳转预测的成功概率又是多少，在这期间发生了多少次上下文切换，多少次page fault，cache的命中率又如何，等等。

perf list

branch-instructions OR branches                    [Hardware event]
branch-misses                                      [Hardware event]
cache-misses                                       [Hardware event]
... ...

context-switches OR cs                             [Software event]
page-faults OR faults                              [Software event]
... ...

L1-dcache-load-misses                              [Hardware cache event]
L1-dcache-loads                                    [Hardware cache event]
... ...

block:block_bio_backmerge                          [Tracepoint event]
clk:clk_enable                                     [Tracepoint event]
dma_fence:dma_fence_destroy                        [Tracepoint event]
... ...

其中

hardware event 是硬件相关的观测点
software event 是软件相关的观测点
hardware cache event 是cache相关的观测点
tracepoint event 是内核设置的观测点

Perf 功能概览

perf有很多子命令，之前的例子中用到的stat只是其中之一

perf

annotate        Read perf.data (created by perf record) and display annotated code
archive         Create archive with object files with build-ids found in perf.data file
bench           General framework for benchmark suites
buildid-cache   Manage build-id cache.
buildid-list    List the buildids in a perf.data file
c2c             Shared Data C2C/HITM Analyzer.
config          Get and set variables in a configuration file.
data            Data file related processing
diff            Read perf.data files and display the differential profile
evlist          List the event names in a perf.data file
ftrace          simple wrapper for kernel's ftrace functionality
inject          Filter to augment the events stream with additional information
kallsyms        Searches running kernel for symbols
kmem            Tool to trace/measure kernel memory properties
kvm             Tool to trace/measure kvm guest os
list            List all symbolic event types
lock            Analyze lock events
mem             Profile memory accesses
record          Run a command and record its profile into perf.data
report          Read perf.data (created by perf record) and display the profile
sched           Tool to trace/measure scheduler properties (latencies)
script          Read perf.data (created by perf record) and display trace output
stat            Run a command and gather performance counter statistics
test            Runs sanity tests.
timechart       Tool to visualize total system behavior during a workload
top             System profiling tool.
version         display the version of perf binary
probe           Define new dynamic tracepoints

perf的子命令大致可分为2类

基础子命令：用于做一些性能数据的列举，统计和展示
功能型子命令：针对特定的功能进行数据的采样和分析

基础子命令

子命令名	作用
list	列出当前系统支持的所有事件，分为4种类型：hardware event，software event，hardware cache event，tracepoint event
stat	执行某个命令，收集特定进程的性能概况
record	收集采样信息，并将其记录在数据文件中，随后可通过其它工具对数据文件进行分析
report	读取perf record创建的数据文件，并给出分析结果
probe	动态增加tracepoint
script	执行perl或python写的功能扩展脚本、生成脚本框架、读取数据文件中的数据信息等
timechart	分析CPU在一段时间内的使用情况

功能型子命令

子命令名	作用
lock	分析内核中锁的使用情况
kmem	分析内核slab的使用情况
sched	分析内核调度事件的情况
top	top的升级版本，能够看到各个进程实时运行的函数
bench	perf内部的一个benchmark本城
evlist	查看perf.data采样了的事件
kallsyms	判别一个符号是否在当前运行的kernel中导出

Perf 基础子命令

perf list

perf list

branch-instructions OR branches                    [Hardware event]
branch-misses                                      [Hardware event]
bus-cycles                                         [Hardware event]
cache-misses                                       [Hardware event]
cache-references                                   [Hardware event]
cpu-cycles OR cycles                               [Hardware event]
instructions                                       [Hardware event]
ref-cycles                                         [Hardware event]
stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
... ...

alignment-faults                                   [Software event]
bpf-output                                         [Software event]
context-switches OR cs                             [Software event]
cpu-clock                                          [Software event]
cpu-migrations OR migrations                       [Software event]
dummy                                              [Software event]
emulation-faults                                   [Software event]
major-faults                                       [Software event]
minor-faults                                       [Software event]
... ...

L1-dcache-load-misses                              [Hardware cache event]
L1-dcache-loads                                    [Hardware cache event]
L1-dcache-store-misses                             [Hardware cache event]
L1-dcache-stores                                   [Hardware cache event]
LLC-load-misses                                    [Hardware cache event]
LLC-loads                                          [Hardware cache event]
LLC-store-misses                                   [Hardware cache event]
LLC-stores                                         [Hardware cache event]
... ...

block:block_bio_backmerge                          [Tracepoint event]
clk:clk_enable                                     [Tracepoint event]
dma_fence:dma_fence_destroy                        [Tracepoint event]
ext4:ext4_alloc_da_blocks                          [Tracepoint event]
gadget:usb_ep_alloc_request                        [Tracepoint event]
irq:irq_handler_exit                               [Tracepoint event]
irq:softirq_entry                                  [Tracepoint event]
irq:softirq_exit                                   [Tracepoint event]
irq:softirq_raise                                  [Tracepoint event]
jbd2:jbd2_checkpoint                               [Tracepoint event]
kmem:kmalloc                                       [Tracepoint event]
mmc:mmc_request_done                               [Tracepoint event]
neigh:neigh_update_done                            [Tracepoint event]
net:napi_gro_frags_entry                           [Tracepoint event]
nfs4:ff_layout_commit_error                        [Tracepoint event]
oom:compact_retry                                  [Tracepoint event]
page_pool:page_pool_release                        [Tracepoint event]
pagemap:mm_lru_activate                            [Tracepoint event]
percpu:percpu_alloc_percpu                         [Tracepoint event]
power:clock_disable                                [Tracepoint event]
rcu:rcu_utilization                                [Tracepoint event]
regmap:regcache_drop_region                        [Tracepoint event]
regmap:regmap_reg_write                            [Tracepoint event]
rpcgss:rpcgss_bad_seqno                            [Tracepoint event]
sched:sched_kthread_stop                           [Tracepoint event]
signal:signal_deliver                              [Tracepoint event]
skb:consume_skb                                    [Tracepoint event]
smbus:smbus_read                                   [Tracepoint event]
sock:inet_sock_set_state                           [Tracepoint event]
spi:spi_controller_busy                            [Tracepoint event]
swiotlb:swiotlb_bounced                            [Tracepoint event]
task:task_newtask                                  [Tracepoint event]
tcp:tcp_destroy_sock                               [Tracepoint event]
timer:hrtimer_cancel                               [Tracepoint event]
vmscan:mm_shrink_slab_end                          [Tracepoint event]
workqueue:workqueue_activate_work                  [Tracepoint event]
writeback:balance_dirty_pages                      [Tracepoint event]

其中主要分为4种类型的采样事件

hardware event 是硬件相关的观测点
software event 是软件相关的观测点
hardware cache event 是cache相关的观测点
tracepoint event 是内核设置的观测点

perf stat

perf stat子命令可以采样perf list种展示的4种类型的数据，并展示。

同时，perf stat子命令也具有相当多的参数，其中，-e参数用来指定具体的event，是必选的。

perf stat -h

-a, --all-cpus        system-wide collection from all CPUs
-A, --no-aggr         disable CPU count aggregation
-B, --big-num         print large numbers with thousands' separators
-C, --cpu <cpu>       list of cpus to monitor in system-wide
-D, --delay <n>       ms to wait before starting measurement after program start (-1: start with events disabled)
-d, --detailed        detailed run - start a lot of events
-e, --event <event>   event selector. use 'perf list' to list available events
-G, --cgroup <name>   monitor event in cgroup name only
-g, --group           put the counters into a counter group
... ...

实例:

hardware event：

perf stat -e instructions ls

Performance counter stats for 'ls':

           267995      instructions                                                
... ...

可以看到，执行ls，cpu运行了267995条指令

software event：

perf stat -e context-switches,page-faults dmesg

Performance counter stats for 'dmesg':

                6      context-switches                                            
               35      page-faults                                                 
... ...

可以看到，执行dmesg，cpu发生了6次上下文切换，35次page fault

hardware cache event：

perf stat -e L1-dcache-store-misses ls

Performance counter stats for 'ls':

              365      L1-dcache-store-misses                                      
... ...

可以看到，执行ls，cpu发生了365次L1 cache的store miss（写不命中）

tracepoint event：

tracepoint event本身就分为很多种类型，例如：block:block_bio_backmerge，就表示block这个类型的block_bio_backmerge这个采样点。

tracepoint event支持采样类型和采样点非常多，有clk，mmc，sched等类型，而其中每个采样类型又分为更多的采样点，因此，可以通过perf list后面跟上采样类型的名字，来查看某个特性类型的采样点。

perf list sched

sched:sched_kthread_stop                           [Tracepoint event]
sched:sched_kthread_stop_ret                       [Tracepoint event]
sched:sched_migrate_task                           [Tracepoint event]
sched:sched_move_numa                              [Tracepoint event]
sched:sched_pi_setprio                             [Tracepoint event]
sched:sched_process_exec                           [Tracepoint event]
sched:sched_process_exit                           [Tracepoint event]
sched:sched_process_fork                           [Tracepoint event]
sched:sched_process_free                           [Tracepoint event]
sched:sched_process_hang                           [Tracepoint event]
sched:sched_process_wait                           [Tracepoint event]
sched:sched_stat_runtime                           [Tracepoint event]
sched:sched_stick_numa                             [Tracepoint event]
sched:sched_swap_numa                              [Tracepoint event]
sched:sched_switch                                 [Tracepoint event]
sched:sched_wait_task                              [Tracepoint event]
sched:sched_wake_idle_without_ipi                  [Tracepoint event]
sched:sched_wakeup                                 [Tracepoint event]
sched:sched_wakeup_new                             [Tracepoint event]
sched:sched_waking                                 [Tracepoint event]

可以看到，一个sched类型，对应很多采样点。

这些采样点由内核代码中已经写好的代码构成，例如：

2458 static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
2459                struct rq_flags *rf)
2460 {
2461     check_preempt_curr(rq, p, wake_flags);
2462     p->state = TASK_RUNNING;
2463     trace_sched_wakeup(p);
         ... ...

第2463行的trace_sched_wakeup就是一个tracepoint，对应着sched:sched_wakeup这个采样点。

4410 static void __sched notrace __schedule(bool preempt)
4411 {
             ... ...
4525         trace_sched_switch(preempt, prev, next);
             ... ...

第4525行的trace_sched_switch就是一个tracepoint，对应着sched:sched_switch这个采样点。

通过perf stat子命令可以得到观测结果

perf stat -e sched:sched_switch -a sleep 1

Performance counter stats for 'system wide':

               29      sched:sched_switch

可以看到，执行ls，cpu运行了29次sched_switch，也就是对应的__schedule函数。

但看这个效果，貌似tracepoint event功能并不强大，但其实tracepoint event最强大的地方在于时间的统计，内核中每个采样点都代表了一些重要的时刻，比如，进程切换了，调度器把进程放入runqueue，或者开始真正的进程运行，都会加上一个tracepoint，来记录时间，从而为分析和调试提供支持。

perf record & perf report

perf stat子命令展示的是即时的数据，若想要获取更多信息，则需要用perf record子命令将信息组成一个perf.data文件，并利用perf report将其解析并展示出来。

同样的，perf record和perf report子命令也具有相当多的参数，其中，-e参数用来指定具体的event，是必选的

perf record -h

-a, --all-cpus        system-wide collection from all CPUs
-b, --branch-any      sample any taken branches
-B, --no-buildid      do not collect buildids in perf.data
-c, --count <n>       event period to sample
-C, --cpu <cpu>       list of cpus to monitor
-d, --data            Record the sample addresses
-D, --delay <n>       ms to wait before starting measurement after program start (-1: start with events disabled)
-e, --event <event>   event selector. use 'perf list' to list available events
... ...

perf report -h

-b, --branch-stack    use branch records for per branch histogram filling
-c, --comms <comm[,comm...]>
only consider symbols in these comms
-C, --cpu <cpu>       list of cpus to profile
... ...

实例

hardware event：

perf record -e instructions -a ls

perf.data
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (19 samples) ]

可以看到，perf record子命令生成了一个perf.data；继续用perf report解析并展示这个perf.data

perf report

# Samples: 19  of event 'instructions'
# Event count (approx.): 767713
#
# Overhead  Command  Shared Object      Symbol                         
# ........  .......  .................  ...............................
    11.71%  ls       busybox            [.] 0x0000000000018954
    11.67%  ls       [kernel.kallsyms]  [k] fput
    11.64%  ls       [kernel.kallsyms]  [k] ___might_sleep
    10.72%  ls       [kernel.kallsyms]  [k] do_page_fault
     8.83%  perf     [kernel.kallsyms]  [k] free_pages_and_swap_cache
     8.27%  perf     [kernel.kallsyms]  [k] unmap_page_range
     7.54%  perf     [kernel.kallsyms]  [k] page_remove_rmap
     7.14%  perf     [kernel.kallsyms]  [k] __alloc_pages_nodemask
     7.02%  perf     [kernel.kallsyms]  [k] __asm_copy_to_user
     5.85%  perf     [kernel.kallsyms]  [k] dput.part.30
     4.21%  perf     [kernel.kallsyms]  [k] handle_mm_fault
     1.78%  perf     [kernel.kallsyms]  [k] seq_printf
     1.49%  swapper  [kernel.kallsyms]  [k] finish_task_switch
     1.07%  perf     [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     1.07%  perf     [kernel.kallsyms]  [k] fput
     0.00%  perf     [kernel.kallsyms]  [k] generic_exec_single

可以看到，perf report子命令不光展示了cpu执行ls所运行的instruction数，还展示了这些指令数都消耗在哪些进程和函数中。

software event：

perf record -e context-switches -a sleep 1

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (17 samples) ]

# perf report
# Samples: 17  of event 'context-switches'
# Event count (approx.): 36
#
# Overhead  Command          Shared Object     Symbol                
# ........  ...............  ................  ......................
    47.22%  swapper          [unknown]         [.] 0000000000000000
    11.11%  perf             [unknown]         [.] 0000000000000000
     8.33%  kcompactd0       [unknown]         [.] 0000000000000000
     8.33%  kworker/0:1-eve  [unknown]         [.] 0000000000000000
     8.33%  rcu_sched        [unknown]         [.] 0000000000000000
     5.56%  kworker/1:0-mm_  [unknown]         [.] 0000000000000000
     5.56%  sleep            [unknown]         [.] 0000000000000000
     2.78%  migration/0      [unknown]         [.] 0000000000000000
     2.78%  ntpd             [unknown]         [.] 0000000000000000

可以看到，perf report子命令不光展示了cpu执行sleep 1时发生的上下文切换次数，还展示了这切换都分布在哪些进程中。

hardware cache event：

perf record -e L1-dcache-load-misses -a ls

perf.data      perf.data.old
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.018 MB perf.data (3 samples) ]

# perf report
# Samples: 3  of event 'L1-dcache-load-misses'
# Event count (approx.): 4098
#
# Overhead  Command  Shared Object      Symbol                 
# ........  .......  .................  .......................
    99.95%  ls       busybox            [.] 0x00000000000c293c
     0.05%  perf     [kernel.kallsyms]  [k] generic_exec_single

可以看到，perf report子命令不光展示了cpu执行ls时发生的L1 dcache load miss次数，还展示了这些miss都分布在哪些进程和函数中。

tracepoint event：

perf record -e sched:sched_switch -a ls

perf.data      perf.data.old
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.019 MB perf.data (10 samples) ]
# perf report
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 10  of event 'sched:sched_switch'
# Event count (approx.): 10
#
# Overhead  Command      Shared Object     Symbol                
# ........  ...........  ................  ......................
#
    40.00%  swapper      [unknown]         [.] 0000000000000000
    30.00%  perf         [unknown]         [.] 0000000000000000
    10.00%  ls           [unknown]         [.] 0000000000000000
    10.00%  migration/0  [unknown]         [.] 0000000000000000
    10.00%  migration/1  [unknown]         [.] 0000000000000000

可以看到，perf report子命令不光展示了cpu执行ls时发生的运行至sched:sched_switch这个tracepoint的次数，还展示了这tracepoint都分布在哪些进程中。

小结一下，对比stat和record，会发现，stat记录的数据是一种counting（计数）类型的数据，而record记录的是一种sampling（采样）类型的数据；record的原理是依赖定期地中断上报来搜集cpu运行状态的信息，所以负荷也会比stat高一些。

perf probe

有的时候，tracepoint不能满足分析过程中的需求，比如在pick_next_task_fair这个函数里没有任何tracepoint，那该如何对这个函数的调用情况进行分析呢？这时候就可以利用perf probe动态地设置一个tracepoint

perf probe pick_next_task_fair

Added new event:
  probe:pick_next_task_fair (on pick_next_task_fair)

You can now use it in all perf tools, such as:

        perf record -e probe:pick_next_task_fair -aR sleep 1

perf probe --list

  probe:pick_next_task_fair (on pick_next_task_fair)

可以看到，perf probe子命令已经设置了一个新的tracepoint，类型是probe，采样点是pick_next_task_fair

接下来可以通过之前介绍过的perf stat和perf record/report来采样和解析以及展示

perf stat -e probe:pick_next_task_fair -a sleep 1

 Performance counter stats for 'system wide':

                24      probe:pick_next_task_fair

# perf record -e probe:pick_next_task_fair -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.020 MB perf.data (28 samples) ]

# perf report
#
# Samples: 28  of event 'probe:pick_next_task_fair'
# Event count (approx.): 28
#
# Overhead  Command          Shared Object      Symbol                 
# ........  ...............  .................  .......................
    50.00%  swapper          [unknown]          [.] 0xffffffe000245b6e
     7.14%  kcompactd0       [unknown]          [.] 0xffffffe000245b6e
     7.14%  perf             [unknown]          [.] 0xffffffe000245b6e
     7.14%  rcu_sched        [unknown]          [.] 0xffffffe000245b6e
     7.14%  sleep            [unknown]          [.] 0xffffffe000245b6e
     3.57%  khungtaskd       [unknown]          [.] 0xffffffe000245b6e
     3.57%  kworker/0:1-eve  [unknown]          [.] 0xffffffe000245b6e
     3.57%  kworker/1:0-eve  [unknown]          [.] 0xffffffe000245b6e
     3.57%  migration/0      [kernel.kallsyms]  [k] pick_next_task_fair
     3.57%  migration/1      [kernel.kallsyms]  [k] pick_next_task_fair
     3.57%  ntpd             [unknown]          [.] 0xffffffe000245b6e

该功能其实依赖了kprobe，这种动态设置的方法实际上是利用的kprobe的功能，只不过perf将其集成为自己的一个功能项（子命令），使perf本身成为了一个入口（前端），使用更加便捷。

perf script

perf script子命令可以展示最原始的采样细节（比perf record & report还要丰富的细节）。

在通过perf record子命令记录了cpu-cycles的事件后，再利用perf script子命令就可以解析出最原始的采样信息，但这种信息不利于人眼解读

perf 202 [000] 1724.046093:    1 cycles ffffffe000495664 generic_exec_single+0x88
perf 202 [000] 1724.046109: 4096 cycles ffffffe000495664 generic_exec_single+0x88
perf 202 [000] 1724.046117: 4096 cycles ffffffe000495664 generic_exec_single+0x88
perf 202 [000] 1724.046125:11264 cycles ffffffe000495664 generic_exec_single+0x88
perf 202 [000] 1724.046166:46273 cycles ffffffe000abc7a8 _raw_spin_unlock_irqrest+0x3c
perf 202 [001] 1724.046257:    1 cycles ffffffe000495664 generic_exec_single+0x88
perf 202 [001] 1724.046270: 4096 cycles ffffffe000495664 generic_exec_single+0x88
perf 202 [001] 1724.046279: 4096 cycles ffffffe000495664 generic_exec_single+0x88
perf 202 [001] 1724.046287:13184 cycles ffffffe000495664 generic_exec_single+0x88
perf 202 [001] 1724.046334:57314 cycles ffffffe0007c66c2 strncpy_from_user+0x40

火焰图

因此perf script主要被用来生成perf.unfold文件，被交给一个名叫FlameGraph（火焰图）的工具，这个工具会解析perf.unfold数据，然后将其转换成易于人类阅读和分析的图形。

下载安装 FlameGrap 工具：

wget "https://github.com/brendangregg/FlameGraph/archive/master.zip"
unzip master.zip

On CPU火焰图

如果希望了解CPU在一段时间内的都运行了哪些函数以及各函数都消耗了多少时间，就可以使用On CPU火焰图，这种火焰图基于cpu-cycles事件进行采样，然后通过svg图片格式展现出来

dd if=/dev/zero of=/tmp/testfile bs=4K count=102400 &
perf record -e cpu-cycles -a -g sleep 1
perf script > perf.unfold
cd FlameGraph-master
./stackcollapse-perf.pl < ../perf.unfold | ./flamegraph.pl > ../perf.svg

首先在后台启动一个dd命令，让它持续运行一段时间，然后开启perf record，记录一秒钟内cpu都运行了多少个cpu-cycles，也就是时间（同时使能-g，就会一并记录运行的函数以及调用关系），再利用perf script命令将perf.data转成perf.unfold，最后利用FlameGraph工具将其转换成一个perf.svg，这是一个图形文件，用浏览器打开后会得到这样一幅图：

此图记录着函数调用关系及其cpu-cycles（时间）占比，就像一缕缕升起的火苗，所以被称之为火焰图。

火焰图还可以通过鼠标点击放大，观察其细节，如下：

Mem火焰图

火焰图当然也可以基于内存，但目前基于内存的事件玄铁910还未支持，所以暂时用page fault事件来代替

perf record -e page-faults -a -g sleep 1
perf script > perf.unfold
cd FlameGraph
./stackcollapse-perf.pl < ../perf.unfold | ./flamegraph.pl --colors=mem > ../perf.svg

用浏览器打开perf.svg，就可以看到page fault发生的分布

值得注意的是，Mem火焰图在生成时会使用--colors=mem参数，来生成一个绿色的火焰图，以示和On CPU火焰图的区别。

如果能基于malloc进行采样的话，就可以分析出一个应用程序调用malloc的分布

但可惜现在还有点bug，正在修复中

Off CPU火焰图

如果要看看CPU对IO的使用情况，可以制作Off CPU火焰图。

但目前Off CPU火焰图现在还有些问题，正在修复中。

Diff火焰图

diff火焰图可以将2个unfold文件进行对比，然后生成一个总的Diff火焰图，这样就可以清晰地看出2个火焰图的差异。

如果需要分析诸如”两种配置参数下性能差异的问题“，这将会是非常好的方法。

使用方法和上述方法无异，只是要将生成的fold文件使用difffold.pl进行输出即可

./difffolded.pl perf.folded1 perf.folded2 | ./flamegraph.pl > diff.svg

perf timechart

火焰图适合分析热点函数，或者指令，但若想分析时序问题（进程/或者cpu在一段时间内都在干什么，例如狂占cpu，或者idle，或者在等待io），就得使用perf timechart子命令。

为了展示timechart的作用，先跑起2个cpu消耗型的测试程序，代码如下：

void *a(void *p)
{
    while (1) {
        for (int i = 0; i < 100000000; i++);
        usleep(200000);
    }
    return NULL;
}

void *b(void *p)
{
    while (1) {
        for (int i = 0; i < 100000000; i++);
        usleep(200000);
    }
    return NULL;
}

int main(void)
{
    pthread_t t1, t2;
    pthread_create(&t1, NULL, a, NULL);
    pthread_create(&t2, NULL, b, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    return 0;
}

./test &
perf timechart record -g -a sleep 3
perf timechart

然后启动perf timechart子命令开始记录，最后再用perf timechart子命令生成一个output.svg图，将之用浏览器打开

图中最上方的标记所示解释了每个颜色的含义

接下来两行表示两个cpu的运行状态，如果所示，两个cpu大部分时间里都是处于Running状态的，只有少部分事件是处于Idle

再接下来就是每个进程小号cpu的情况了，如果所示，大部分进程都是灰色的，表示它们都在sleeping，处于一个等待调度的状态，由于系统负载很低，它们即使被调度到了，也不干什么事情，所以看上去就一直处于一个灰色的状态。

综上，perf的基础子命令就介绍完了

总结一下：

perf的list子命令可以展示所有的事件
perf可以通过stat子命令直接展示信息
也可以通过record子命令生成perf.data，然后利用report展示更多的信息，但script命令可以展示最原始的信息，但并易于阅读，所以可以用来生成火焰图
perf的stat子命令也可以读取perf.data（这有点特殊，只在perf kmem子命令中被用到）
perf的probe子命令可以不修改代码地增加观测事件

接下来介绍perf功能型子命令

perf lock

perf lock子命令可以展示一定时间内系统对锁的使用情况

perf lock record -a sleep 1
perf report

# Samples: 38K of event 'lock:lock_acquire'
# Event count (approx.): 38474
    89.15%  perf             [unknown]         [.] 0000000000000000
     8.35%  swapper          [unknown]         [.] 0000000000000000
     1.05%  rcu_sched        [unknown]         [.] 0000000000000000
     0.92%  sleep            [unknown]         [.] 0000000000000000
     0.18%  ksoftirqd/1      [unknown]         [.] 0000000000000000
     0.11%  ntpd             [unknown]         [.] 0000000000000000
     0.06%  kcompactd0       [unknown]         [.] 0000000000000000
     0.05%  haveged          [unknown]         [.] 0000000000000000
     0.04%  kworker/1:0-mm_  [unknown]         [.] 0000000000000000
     0.04%  migration/1      [unknown]         [.] 0000000000000000
     0.03%  kworker/0:1-mm_  [unknown]         [.] 0000000000000000
     0.03%  migration/0      [unknown]         [.] 0000000000000000 
# Samples: 25K of event 'lock:lock_contended'
# Event count (approx.): 25431     
    91.47%  perf             [unknown]         [.] 0000000000000000
     6.49%  swapper          [unknown]         [.] 0000000000000000
     1.05%  rcu_sched        [unknown]         [.] 0000000000000000
     0.66%  sleep            [unknown]         [.] 0000000000000000
     0.08%  ntpd             [unknown]         [.] 0000000000000000
     0.06%  kcompactd0       [unknown]         [.] 0000000000000000
     0.04%  migration/1      [unknown]         [.] 0000000000000000
     0.04%  haveged          [unknown]         [.] 0000000000000000
     0.04%  migration/0      [unknown]         [.] 0000000000000000
     0.02%  kworker/0:1-mm_  [unknown]         [.] 0000000000000000
     0.02%  kworker/1:0-mm_  [unknown]         [.] 0000000000000000
     0.02%  ksoftirqd/1      [unknown]         [.] 0000000000000000
# Samples: 38K of event 'lock:lock_release'
# Event count (approx.): 38448     
    89.14%  perf             [unknown]         [.] 0000000000000000
     8.36%  swapper          [unknown]         [.] 0000000000000000
     1.05%  rcu_sched        [unknown]         [.] 0000000000000000
     0.93%  sleep            [unknown]         [.] 0000000000000000
     0.18%  ksoftirqd/1      [unknown]         [.] 0000000000000000
     0.11%  ntpd             [unknown]         [.] 0000000000000000
     0.06%  kcompactd0       [unknown]         [.] 0000000000000000
     0.05%  haveged          [unknown]         [.] 0000000000000000
     0.04%  migration/1      [unknown]         [.] 0000000000000000
     0.03%  kworker/0:1-mm_  [unknown]         [.] 0000000000000000
     0.03%  migration/0      [unknown]         [.] 0000000000000000
     0.03%  kworker/1:0-mm_  [unknown]         [.] 0000000000000000

perf report展示了1秒钟内，系统的锁获取，锁竞争，锁释放的情况

perf lock子命令，其实就是tracepoint event采样事件中的lock事件

perf list lock

... ...
lock:lock_acquire                                  [Tracepoint event]
lock:lock_acquired                                 [Tracepoint event]
lock:lock_contended                                [Tracepoint event]
lock:lock_release                                  [Tracepoint event]
... ...

像这样的子命令，还有perf kmem和perf sched，但它们又不仅仅是一个简单的wrapper，而是具有更加具有可读性的展示。

perf kmem

perf kmem子命令用于记录内核slab分配器的事件，使用的是tracepoint event中的kmem事件类型

perf kmem record ls
perf kmem stat

SUMMARY (SLAB allocator)
========================
Total bytes requested: 56220
Total bytes allocated: 56824
Total bytes freed:     52968
Net total bytes allocated: 3856
Total bytes wasted on internal fragmentation: 604
Internal fragmentation: 1.062931%
Cross CPU allocations: 0/55

简单的stat输出的信息只是一个概要，但如果使用

perf kmem --alloc --caller --slab stat

就会展示更加丰富的信息

-------------------------------------------------------------
 Callsite | Total_alloc | Total_req | Hit | Ping-pong | Frag
-------------------------------------------------------------
 load_elf_phdrs+6e     | 512/512  | 280/280 | 1 | 0 | 45.312%
 __alloc_skb+98        | 1024/1024| 576/576 | 1 | 0 | 43.750%
 proc_self_get_link+b6 | 16/16    | 11/11   | 1 | 0 | 31.250%
 ext4_readdir+5e8      | 64/64    | 48/48   | 1 | 0 | 25.000%
 alloc_bprm+46         | 1024/512 | 832/416 | 2 | 1 | 18.750%
... ...
--------------------------------------------------------------
--------------------------------------------------------------
--------------------------------------------------------------
 0xffffffe0000c9e00 |  512/512   | 280/280   | 1 | 0 | 45.312%
 0xffffffe00010e000 | 1024/1024  | 576/576   | 1 | 0 | 43.750%
 0xffffffe00201b2b0 |   16/16    |  11/11    | 1 | 0 | 31.250%
 0xffffffe0000dba80 |   64/64    |  48/48    | 1 | 0 | 25.000%
 0xffffffe0000db780 |   64/64    |  50/50    | 1 | 0 | 21.875%
... ...

其中Callsite记录了调用kmalloc的地方，Hit记录了调用的次数，Total_alloc/Per记录了分配的大小，Frag表示碎片占比，Ping-pong记录了申请内存和释放内存是否在同一个cpu上，这对于cache的性能调优会有所助益，Alloc Ptr由--alloc选项导出，记录了申请的内存的虚拟地址。

perf sched

perf sched子命令用于记录内核调度相关的事件，使用的是tracepoint event中的sched事件类型

perf sched record ls

在通过record记录了sched事件后，可以通过report和script来查看，这里就不举例了，来看看perf sched子命令有什么强力的拓展

通过dd命令生成一个文件在eMMC上（这样才能看到明显的延迟，DDR上的操作延迟太小）

perf sched record dd if=/dev/zero of=./testfile bs=4K count=1024
perf sched latency

 ------------------------------------------------------------------------------------
  Task | Runtime ms | Switches | Avg delay | Max delay | Max delay start | Max delay end 
 ------------------------------------------------------------------------------------
  dd:312           |68.054 ms | 2 | 0.210 ms | 0.420 ms | 629.787396 s | 629.787816 s
  sshd:272         | 0.327 ms | 1 | 0.000 ms | 0.000 ms |   0.000000 s |   0.000000 s
  kworker/u4:0-ev:7| 0.125 ms | 1 | 0.000 ms | 0.000 ms |   0.000000 s |   0.000000 s
  migration/1:14   | 0.000 ms | 1 | 0.000 ms | 0.000 ms |   0.000000 s |   0.000000 s
  migration/0:11   | 0.000 ms | 1 | 0.000 ms | 0.000 ms |   0.000000 s |   0.000000 s
 ------------------------------------------------------------------------------------
  TOTAL:                |     68.691 ms |        7 |
 ------------------------------------------------------------------------------------

可以看到，所有dd相关的进程的延迟信息都被展现出来了

perf sched map

# perf sched map
  *A0        1021.655030 secs A0 => migration/0:11
  *.         1021.655100 secs .  => swapper:0
  *B0        1021.655570 secs B0 => perf:334
   B0 *.     1021.655880 secs
  *C0  .     1021.658667 secs C0 => rcu_sched:10
  *B0  .     1021.658719 secs B0 => dd:334
  *C0  .     1021.666567 secs
  *B0  .     1021.666601 secs
  *D0  .     1021.722001 secs D0 => kworker/u4:2-ev:314
  *E0  .     1021.722097 secs E0 => sshd:322
  *B0  .     1021.722419 secs
  *C0  .     1021.722575 secs
  *B0  .     1021.722612 secs
   B0 *F0    1021.722897 secs F0 => perf:333
  *.   F0    1021.722914 secs
   .  *G0    1021.723142 secs G0 => migration/1:14
  *F0  G0    1021.723200 secs
   F0 *.     1021.723219 secs
   F0 *F0    1021.723393 secs

perf sched map展示了调度信息的一个总览，*表示发生调度，点表示该cpu处于idle

perf sched timehist

Samples do not have callchains.
           time    cpu  task name                       wait time  sch delay   run time
                        [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  ------------------------------  ---------  ---------  ---------
    1021.655030 [0000]  perf[333]                           0.000      0.000      0.000
    1021.655100 [0000]  migration/0[11]                     0.000      0.025      0.069
    1021.655570 [0000]  <idle>                              0.000      0.000      0.470
    1021.655880 [0001]  perf[333]                           0.000      0.000      0.000
    1021.658667 [0000]  dd[334]                             0.000      0.032      3.097
    1021.658719 [0000]  rcu_sched[10]                       0.000      0.116      0.052
    1021.666567 [0000]  dd[334]                             0.052      0.000      7.848
    1021.666601 [0000]  rcu_sched[10]                       7.848      0.049      0.033
    1021.722001 [0000]  dd[334]                             0.033      0.000     55.400
    1021.722097 [0000]  kworker/u4:2-ev[314]                0.000      0.074      0.095
    1021.722419 [0000]  sshd[322]                           0.000      0.059      0.322
    1021.722575 [0000]  dd[334]                             0.418      0.000      0.156
    1021.722612 [0000]  rcu_sched[10]                      55.974      0.035      0.037
    1021.722897 [0001]  <idle>                              0.000      0.000     67.017
    1021.722914 [0000]  dd[334]                             0.037      0.000      0.301
    1021.723142 [0001]  perf[333]                          67.017      0.046      0.245
    1021.723200 [0000]  <idle>                             67.344      0.000      0.286
    1021.723219 [0001]  migration/1[14]                     0.000      0.026      0.077
    1021.723393 [0001]  <idle>                              0.322      0.000      0.173

timehist可以展示这一秒钟内，所有进程的调度情况，wait time表示从被唤醒到开始调度的时间，sch delay表示从开始调度到真正运行的时间。

perf top

perf top子命令动态地显示各种采样事件，以下是2个实例

hardware event之instructions

perf top -e instructions

    10.87%  [kernel]          [k] format_decode
     8.26%  [kernel]          [k] number
     6.86%  [kernel]          [k] kallsyms_expand_symbol.constprop.8
     5.32%  [kernel]          [k] vsnprintf
     3.47%  [kernel]          [k] string_nocheck
     2.95%  perf              [.] 0x00000000000948f0
     2.78%  libc-2.29.so      [.] memcpy
     2.48%  libc-2.29.so      [.] strcmp
     2.38%  [kernel]          [k] update_iter
     2.20%  libc-2.29.so      [.] strchr
     2.10%  libc-2.29.so      [.] __libc_calloc
     1.70%  libc-2.29.so      [.] memset
     1.45%  perf              [.] 0x0000000000082b78
     1.43%  libc-2.29.so      [.] strlen
     1.29%  perf              [.] 0x0000000000082c06
     1.20%  [kernel]          [k] __softirqentry_text_start
     1.15%  [kernel]          [k] _raw_spin_unlock_irqrestore
     1.10%  libc-2.29.so      [.] 0x000000000006247e
     1.09%  [kernel]          [k] seq_vprintf
     0.93%  perf              [.] 0x0000000000082b68

tracepoint event之sched:sched_wakeup

perf top -e sched:sched_wakeup

    32.60%  comm=perf pid=416 prio=120 target_cpu=001
    30.39%  comm=perf pid=412 prio=120 target_cpu=001
    17.40%  comm=kworker/0:1H pid=40 prio=100 target_cpu=000
     3.31%  comm=ksoftirqd/0 pid=9 prio=120 target_cpu=000
     3.04%  comm=kcompactd0 pid=26 prio=120 target_cpu=000
     2.76%  comm=ntpd pid=105 prio=120 target_cpu=001
     2.49%  comm=kcompactd0 pid=26 prio=120 target_cpu=001
     2.49%  comm=kworker/0:1 pid=22 prio=120 target_cpu=000
     2.49%  comm=kworker/1:0 pid=16 prio=120 target_cpu=001
     2.49%  comm=perf pid=413 prio=120 target_cpu=000
     0.28%  comm=haveged pid=93 prio=120 target_cpu=000
     0.28%  comm=perf-top-UI pid=417 prio=120 target_cpu=000

如果跟上call-graph功能，perf top子命令将会展示更丰富的函数调用细节

perf top -e instructions --call-graph graph

    16.69%     0.45%  [kernel]            [k] seq_read_iter
            |          
             --0.63%--seq_read_iter
                       |          
                       |--0.98%--s_next
                       |          |          
                       |           --0.89%--update_iter
                       |          
                        --0.64%--s_show

    14.81%     0.18%  [kernel]            [k] s_show
    14.25%     0.16%  [kernel]            [k] seq_printf
    14.20%     0.48%  [kernel]            [k] seq_vprintf
            |          
             --4.51%--seq_vprintf
                       |          
                       |--5.70%--vsnprintf
                       |          |          
                       |          |--4.11%--pointer
                       |          |          |          
                       |          |           --3.56%--number
                       |          |          
                       |          |--2.57%--format_decode
                       |          |          
                       |           --2.36%--string
                       |                     |          
                       |                      --2.10%--string_nocheck
                       |          
                        --1.42%--format_decode

perf bench

perf bench子命令是perf内部集成的一个benchmark测试程序，可以看一下perf bench支持哪些benchmark

perf bench

         sched: Scheduler and IPC benchmarks
       syscall: System call benchmarks
           mem: Memory access benchmarks
         futex: Futex stressing benchmarks
         epoll: Epoll stressing benchmarks
     internals: Perf-internals benchmarks
           all: All benchmarks

可见，benchmark有sched，syscall，mem等几大类

perf bench mem

        memcpy: Benchmark for memcpy() functions
        memset: Benchmark for memset() functions
      find_bit: Benchmark for find_bit() functions
           all: Run all memory access benchmarks

而mem里又包含了3个测试点

perf bench mem memcpy

# Running 'mem/memcpy' benchmark:
# function 'default' (Default memcpy() provided by glibc)
# Copying 1MB bytes ...

       5.778476 GB/sec

perf bench mem memset

# Running 'mem/memset' benchmark:
# function 'default' (Default memset() provided by glibc)
# Copying 1MB bytes ...

       6.300403 GB/sec

perf archive

如果想将采集到的perf.data交给其他人分析，此时仅把perf.data丢给他是不够的，这里就要用到perf archive子命令将perf.data以及相关符号信息打包后，再传给其他人，这样别人就也能分析这个perf.data了。

但目前该功能还有些问题，正在修复中。

perf evlist

perf evlist子命令可以分析出perf.data是基于什么事件采样的

perf evlist

kmem:kmalloc
kmem:kmalloc_node
kmem:kfree
kmem:kmem_cache_alloc
kmem:kmem_cache_alloc_node
kmem:kmem_cache_free
dummy:HG

可以看到，这个是一个基于kmem事件采样的perf.data

perf kallsyms

perf kallsyms子命令可以用于探测一个函数是否存在于内核符号表中，从而决定它是否可以作为一个perf probe的对象

perf kallsyms schedule

schedule: [kernel] [kernel.kallsyms] 0xffffffe000abf44a-0xffffffe000abf52c (0xffffffe000abf44a-0xffffffe000abf52c)

当然，还可以通过

cat /proc/kallsyms | grep schedule

来判断一个函数是不是一个在内核符号表中。

5.4 Perf 使用指南